Melody retrieval system

ABSTRACT

A music retrieval system which take an input melody as the query. In one embodiment, changes or differences in the distribution of energy across the frequency spectrum over time are used to find breakpoints in the input melody in order to separate it into distinct notes. In another embodiment the breakpoints are identified based on changes in pitch over time. A confidence level is preferably associated with each breakpoint and/or note extracted from the input melody. The confidence level is based on one or more of: changes in pitch, absolute values of a spectral energy distribution indicator, relative values of the spectral energy distribution indicator, and the energy level of the input melody. The process of matching the input melody with songs in the music database is based on minimizing a cost computation that takes into account errors in the insertion and deletion of notes, and penalizes these errors in accordance with the confidence levels of the breakpoints and/or notes.

RELATED APPLICATIONS

This application claims priority from U.S. provisional applicationserial no. 60/188,730, entitled, Humming Search Music RecognitionSystem, filed March 13, 2000, which application is hereby incorporatedherein by reference.

COPYRIGHT NOTICE

A portion of the disclosure of this patent document contains materialwhich is subject to copyright protection. The copyright owner has noobjection to the facsimile reproduction by anyone of the patent documentor the patent disclosure, as it appears in the Patent and TrademarkOffice patent files or records, but otherwise reserves all copyrightrights whatsoever.

FIELD OF INVENTION

The invention relates to the field of music retrieval systems and moreparticularly to retrieval systems which take a melody vocalized by auser as the query.

BACKGROUND OF INVENTION

With the proliferation of musical databases now available, e.g., throughthe Internet or jukebox machines, consumers now have ready access toindividual songs or pieces of music available for purchase or listening.However, being surrounded by so much music, it is often difficult for alistener to catch or remember the title of a song or the artist's name.Nevertheless, if the song is of interest to the listener, he or she canoften remember at least a portion of its musical melody. The followingdisclose retrieval of information relating to audio data from a hummedor sung melody taken as a query: U.S. Pat. No. 6,121,530 (Sonoda); A.Ghias, J. Logan, D. Chamberlin, B. C. Smith, Query by Humming, MusicalInformation Retrieval in an Audio Database, Multimedia '95, SanFrancisco, pp. 231-236; N. Kosugi, Y. Nishihara, S. Kon'ya, M. Yamamuro,K. Kushima, Music Retrieval by Humming, Using Similarity Retrieval overHigh Dimensional Feature Vector Space, 1999 IEEE Pacific Rim Conferenceon Communications, Computers and Signal Processing, Page(s) 404-407; andP. V. Rolland, Raskinis, J-G Ganascia, Musical Content-based Retrieval,an overview of the Melodiscov Approach and System.

The invention provides an approach different from those described in theabove-mentioned documents in identifying a musical composition inresponse to a query that is a melody.

SUMMARY OF INVENTION

The invention provides methods and systems for retrieving musicalselections or data identifying musical selections based on a digitalversion of a melody which originated from a sound or electronic source,e.g., a person humming, singing, whistling or otherwise vocalizing themelody; a musical instrument's audio or electronic output; an analog ordigital recording of the melody, etc. Breakpoints between notes areidentified as are distinct notes represented by pitch. In addition, oneor more confidence levels may be associated with the input melody.

A value or confidence level may be assigned to each breakpoint toprovide a measure of confidence that the identified breakpoint is infact a breakpoint. Similarly, a value or confidence level assigned toeach note may provide a measure of confidence that the identified noteis a single note, e.g., does not include two or more notes.

One aspect of the invention provides a method and system for convertinga digitized melody into a series of notes. The method and system receivea digitized representation of an input melody, identify breakpoints inthe melody in order to define notes therein, determine a pitch and beatduration for each note of the melody, and associate a confidence levelwith each breakpoint, or each note, or both.

The confidence levels associated with breakpoints and/or notes may bedetermined using different techniques, some of which are describedherein.

In the preferred embodiment, segmentation of the input melody intodistinct notes divided by breakpoints is based on changes or differencesin the distribution of energy across the frequency spectrum over time.The confidence levels associated with each breakpoint and/or note may bebased on changes in pitch, as well as absolute and relative values of aspectral energy distribution indicator.

One aspect of the invention provides a method and related system forconverting a digitized melody into a sequence of notes. Generallyspeaking, the method involves estimating breakpoints in the input melodybased on changes in the distribution of energy across the frequencyspectrum over time. In the preferred embodiment, the melody is segmentedinto a series of frames. A spectral energy distribution (SED) indicatoris computed for each frame and at least initial breakpoints estimatesare derived based on the SED indicator. Notes are defined betweenadjacent breakpoints.

Another aspect of the invention provides another method and relatedsystem for converting a digitized melody into a sequence of notes. Themethod includes: segmenting the melody into a series of frames;computing the auto-correlation of each frame; estimating the pitch ofeach frame based on (i) a pitch period corresponding to a shift wherethe auto-correlation coefficient associated with the frame is relativelylarge and (ii) the closeness of the pitch estimate to estimates in oneor more adjacent frames; and estimating breakpoints in the melody basedon changes in the pitch estimates, wherein the notes are defined betweenadjacent breakpoints.

Another aspect of the invention provides a method and related system foridentifying breakpoints in a digitized melody. The method includes:segmenting the melody into a series of frames; computing theauto-correlation of each frame; estimating the pitch of each frame basedon (i) a pitch period corresponding to a shift where theauto-correlation coefficient associated with the frame is relativelylarge and (ii) the closeness of the pitch estimate to estimates in oneor more adjacent frames; determining regions of said melody where pitchestimates are likely to be invalid; and identifying the breakpoints inthe melody based on transitions between frames having valid pitchestimates and transitions having invalid pitch estimates.

Other aspects of the invention relate to methods and systems fordetermining confidence levels for breakpoints and/or notes in a waveformrepresenting a melody. These methods include segmenting the waveforminto a series of frames, wherein adjacent breakpoints encompass one ormore sequential frames, each note being defined between adjacentbreakpoints. Then, at least one of the following three steps may beexecuted: (a) computing a spectral energy distribution (SED) indicatorfor each frame; (b) estimating the pitch of each frame; and (c)determining the energy level of each frame. The confidence levels may bebased on any of the following three characteristics: (i) the SEDindicator, (ii) changes in pitch, and (iii) the energy level.

An entry may be retrieved from a music database of sequences of pitchesand beat durations in accordance with a match function that receives thedigitized melody obtained from a melody source as described herein. Amethod and system for implementing the retrieval may determine a scorefor each entry based on a search which minimizes the cost of matchingthe pitches and beat durations of the melody and the entry, and whichmay be based on minimizing a cost computation which may take intoaccount one or more note insertion and/or deletion errors and penalizethe cost in accordance with confidence levels pertaining thereto.

Another aspect of the invention relates to a method and system ofretrieving at least one entry from a music database, wherein each entryis associated with a sequence of pitches and beat durations. The methodincludes receiving a digitized representation of an input melody;identifying breakpoints in the melody in order to define notes therein;associating each breakpoint and/or note with a confidence level; anddetermining a pitch and beat duration for each note of the melody. Then,a score is determined for each database entry based on a search whichminimizes the cost of matching the pitches and beat durations of themelody and the entry. The search considers at least one deletion orinsertion error in a selected note of the melody and, in this event,penalizes the cost of matching based on the confidence level of theselected note or breakpoint associated therewith. At least one entry maythen be presented to a user based on its score.

BRIEF DESCRIPTION OF DRAWINGS

The foregoing and other aspects of the invention will become moreapparent from the following description of preferred embodiments thereofand the accompanying drawings, which illustrate, by way of example, theprinciples of the invention. In the drawings:

FIG. 1 is a system block diagram showing the major components of a musicrecognition system according to a preferred embodiment of the invention;

FIG. 2 is a functional block diagram showing the processing blocks of amelody-to-note conversion subsystem employed in the music recognitionsystem of FIG. 1;

FIG. 3 is a schematic diagram illustrating some of the processingactivities of the melody-to-note conversion subsystem with respect to asample input melody;

FIG. 4A is a normalized energy spectrogram, plotted against time andfrequency, of a sample input melody (which sample differs from themelody referenced in FIG. 3);

FIG. 4B is a graph of the normalized energy spectrum at a first timeframe in FIG. 4A plotted against frequency;

FIG. 4C is a graph of the normalized energy spectrum at a second timeframe in FIG. 4A plotted against frequency;

FIG. 5A is identical to FIG. 4A (and provided on the same drawing sheetas FIGS. 5B and 5C for reference purposes);

FIG. 5B is a graph of a spectral energy distribution indicator, computedin a first manner, which is based upon the spectrogram of FIG. 5A;

FIG. 5C is a graph of a “minimum measure”, as discussed in greaterdetail below, which is based on the spectral energy distributionindicator shown in FIG. 5B;

FIG. 6A is identical to FIG. 4A (and provided on the same drawing sheetas FIGS. 6B and 6C for reference purposes);

FIG. 6B is a graph of a spectral energy distribution indicator, computedin a second manner, which is based upon the spectrogram of FIG. 6A;

FIG. 6C is a graph of a “minimum measure”, as discussed in greaterdetail below, which is based on the spectral energy distributionindicator shown in FIG. 6B;

FIG. 7A is identical to FIG. 4A (and provided on the same drawing sheetas FIGS. 7B and 7C for reference purposes);

FIG. 7B is a graph of a spectral energy distribution indicator, computedin a third manner, which is based upon the spectrogram of FIG. 7A;

FIG. 7C is a graph of a “minimum measure”, as discussed in greaterdetail below, which is based on the spectral energy distributionindicator shown in FIG. 7B;

FIG. 8A is identical to FIG. 4A (and provided on the same drawing sheetas FIGS. 8B and 8C for reference purposes);

FIG. 8B is a graph of a spectral energy distribution indicator, computedin a fourth manner, which is based upon the spectrogram of FIG. 8A;

FIG. 8C is a graph of a “minimum measure”, as discussed in greaterdetail below, which is based on the spectral energy distributionindicator shown in FIG. 8B;

FIG. 9A is identical to FIG. 4A (and provided on the same drawing sheetas FIGS. 9B and 9C for reference purposes);

FIG. 9B is a graph of a spectral energy distribution indicator, computedin a fifth manner, which is based upon the spectrogram of FIG. 9A;

FIG. 9C is a graph of a “minimum measure”, as discussed in greaterdetail below, which is based on the spectral energy distributionindicator shown in FIG. 9B; and

FIG. 10 is a schematic diagram illustrating a process for matchingnotes.

DETAILED DESCRIPTION OF PREFERRED EMBODIMENTS

1. System Overview

FIG. 1 shows a music recognition system 10 which comprises four majorcomponents: a melody-to-note conversion subsystem 12; a music referencedatabase 14; a note-matching engine 16; and an output subsystem 18.

The music recognition system 10 takes a digitized input melody 20obtained from a source 11 as a query. For reasons explained in greaterdetail below, it is preferred that the input melody originate from auser in the form of humming, particularly through intonations of notesthat are combinations of a semi-vowel, such as “l”, and vowel, such as“a” (i.e., notes in the form of “la”). However, the input melody mayalso comprise many other forms of humming, singing, whistling or othersuch types of music-like vocalization. The input melody may alsooriginate from a musical instrument(s). In these cases the source 11represents circuitry for recording and digitizing the user's voice orthe musical instrument. Alternatively, the input melody may originatefrom a recording of some kind, in which case the source 11 representsthe corresponding player and, if necessary, any circuitry for digitizingthe output of the player. The digitized input melody 20 is supplied tothe melody-to-note conversion subsystem 12.

The melody-to-note conversion subsystem 12 converts the digitized inputmelody 20 into a sequence of musical notes characterized by pitch, beatduration and confidence levels. This is accomplished through spectralanalysis techniques described in greater detail below which are used tofind “breakpoints” in the input melody in order to separate it intodistinct notes. The pitch of each note is determined by the periodicityof the input melody waveform between the note-defining breakpoints. Thebeat duration of each note is extracted from the separation of thenotes, i.e., the duration is determined from the time period betweenbreakpoints. To compensate for error in the separation, each breakpointis preferably associated with a confidence level, which indicates howlikely the breakpoint is a valid breakpoint. A confidence level ispreferably also associated with each note to indicate how unlikely theidentified note actually contains more than one note. The output of themelody-to-note conversion subsystem 12 is a differential note and timingfile 150 which comprises the relative difference in pitch and therelative difference in beat duration of consecutive notes. Thedifference is preferably expressed in terms of the logarithm of theratio of the pitch and duration values of the consecutive notes. Thereason for using pitch and duration differences is discussed furtherbelow.

The music reference database 14 stores the differential note and timingfiles of all music or songs searchable by the system 10. Each such filepreferably comprises a short, easily recognizable segment of a song ormusic, i.e., the so-called “signature melody”, but may alternativelyencompass an entire song or piece of music. These files may be generatedin the first instance by the melody-to-note conversion subsystem 12.

The note matching engine 16 compares the differential note and timingfile 150 from the melody-to-note conversion subsystem 12 with songs orpieces of music in the music reference database 14, which are stored ina similar file format. Since different users may vocalize or play a songor piece of music in different key and different tempo, the system 10does not compare the pitch of the uttered melody and the reference filesdirectly, but rather the ratio in pitch between consecutive notes. Forthe same melody, if the scale is shifted to a different frequency, theratio in the frequency (pitch) of the consecutive notes will be thesame. Similarly, to normalize for differences in tempo, the system 10compares the relative duration of the consecutive notes. The notematching engine 16 employs dynamic programming techniques described ingreater detail below for matching the differential note and timing file150 with similarly formatted files stored in the music database 14.These techniques can compensate for pitch errors and insertions ordeletions of notes by the user or the melody-to-note conversionsubsystem 12. The engine 16 calculates a matching score for each song inthe database 14.

The output subsystem 18 sorts the songs or music in the database 16based on the matching scores. The highest ranked song(s) or piece(es) ofmusic is selected for presentation to the user.

2. Melody to Note Conversion

2.1. Overview

FIG. 2 shows the functional blocks of the melody-to-note conversionsubsystem 12. The subsystem 12 generates the following data from thedigitized input melody 20, which is used to construct the outputdifferential note and timing file 150:

-   -   a list of breakpoints, which indicate the boundaries between        distinct notes in the input melody; and    -   a list of pitches, each pitch being associated with each note        between two adjacent breakpoints.

In addition, the subsystem 12 determines one or more confidence levelsrelated to breakpoints and/or notes, and uses one or more of thoseconfidence levels in the construction of the differential note andtiming files 150. Specifically, in the preferred embodiment, aconfidence measure or level is associated with each breakpoint thatindicates the probability that the breakpoint is valid. A confidencemeasure or level may also be associated with each identified note, whichindicates the likelihood that the identified note does not contain morethan one note.

Breakpoints are intended to indicate points of silence or points ofinflection (i.e., alteration in pitch or tone of the voice) in the inputmelody. The embodiments described herein use more than one technique toidentify a breakpoint and determine its confidence level by consideringhow “closely” the various techniques have collectively identified abreakpoint. For example, if all techniques have identified a breakpointat the same particular point in the input melody, the confidence levelassociated with that breakpoint is relatively high. Conversely, if oneor less than all of the techniques do not identify a breakpoint at ornear that particular point in the melody, the confidence level will belower.

In the illustrated embodiment of FIG. 2, three tonal characteristics areconsidered in identifying breakpoints:

-   -   silence, or conversely regions of the input waveform containing        humming (represented by output arrow 60);    -   changes in pitch (represented by output arrow 50); and    -   changes or differences in the distribution of energy across the        frequency spectrum over time (represented by output arrow 90).

The first two characteristics should be intuitively understood for theirvalue in identifying a breakpoint. The last item is a breakpointcharacteristic due to the typical nature of human vocalization. Moreparticularly, as mentioned, users can hum melodies using notes which arecombination of a semi-vowel, such as “l” and a vowel, such as “a”, i.e.“la.”. When enunciating the semi-vowel, it has been found that the mouthis typically actuated in such as way that results in the sound energybeing concentrated at lower frequencies, as compared with the frequencydistribution of the sound energy during the vowel. The preferredembodiment takes advantage of this observation, as discussed in greaterdetail below.

Notes are defined between two adjacent breakpoints. The embodimentsdescribed herein can use one or more than one technique to determine aconfidence level associated with each note, which indicates thelikelihood that the note contains only one note from the input melody.Because this is equivalent to the confidence that a breakpoint was notmissed inside the note, the note confidence measures can be derived fromthe same quantities as used for breakpoint confidence measures, exceptwith an inverse relationship. For example, a large and rapid change inpitch near a breakpoint increases the confidence in that breakpoint.However, large and rapid changes in pitch in the interval between twobreakpoints decreases the confidence that a breakpoint has not beenmissed. As with breakpoint confidence measures, note confidence measuresmay be based on one or more different indicators.

2.2. Detailed Discussion

One set of processing steps of the subsystem 12 begins by filtering theinput melody 20 (alternatively referred to as the “input waveform”) witha bandpass filter 25 in order to attenuate frequency components that lieoutside the range of expected pitches.

Next, a framer 30 segments the filtered input waveform into a sequenceof “frames” of equal period, e.g., 1/32 of a second. Each frame containsa short portion of the total filtered input waveform. Adjacent framesmay contain overlapping parts of the filtered input waveform to providefor some degree of continuity therebetween, as known in the art per se.The overlap is preferably a tunable parameter and may be expressed as apercentage. Every part of the filtered input waveform is thusrepresented in at least one frame.

The auto-correlation of each frame is then computed at block 35. Theauto-correlation c[l] of a waveform x[n] is defined as the sequence${c\lbrack l\rbrack} = {\sum\limits_{k = {- \infty}}^{\infty}{{x\lbrack k\rbrack}{{x\left\lbrack {l + k} \right\rbrack}.}}}$This provides a measure of the similarity of a signal with a shiftedversion of itself, where the amount of shift is given by l. Theauto-correlation is related to the spectral energy distribution of x[n].The auto-correlation computation will yield a multitude ofauto-correlation coefficients for each frame. As known in the art, peaksin the auto-correlation provide an indication of the periodicity orpitch of a waveform, which in this case is the part of the filteredinput waveform contained in each frame.

Block 45 provides a frame-by-frame pitch estimate 50. This is carriedout by first identifying the “largest” peaks in the auto-correlation ofeach frame, e.g., the top 2-10 auto-correlation values. This yields anumber of “pitch period candidates”. The estimated pitch period of theframe is determined by selecting the pitch period candidate thatcorresponds to a large auto-correlation peak while simultaneouslyconsidering how “close” the pitch period candidate is to pitch periodestimates in one or more adjacent frames. The adjacent frames may bepreceding or receding frames, or both. The preferred embodiment employsa cost function which weights the size of the auto-correlation peaks andthe closeness of the corresponding pitch period candidates to pitchperiod estimates in adjacent frames. This analysis presumes that thehuman vocal tract cannot radically alter pitch in the short time periodrepresented by a frame, e.g., 1/32 second. If no such pitch period canbe found from among the possible pitch period candidates, the pitchmeasurement block 45 labels that frame as containing no pitch. In thismanner the possibility that there is no reliable pitch in the frame isalso considered.

For example, let p_(i) be the pitch period in frame i, where p_(i) iseither one of the identified pitch period candidates or a valueindicating the lack of any identified pitch. An example cost function is$\sum\limits_{i}\left\{ {{D\left( {p_{i - 1},p_{i}} \right)} + \left( {1 - {c\left\lbrack p_{i} \right\rbrack}} \right)} \right\}$where the sum is taken over all frames in the input melody. The functionD(p_(i−1),p_(i)) measures the difference between adjacent pitch periodestimates, for example D(p_(i−1),p_(i))=α|ln(p_(i))−ln(p_(i−1))| whereα=2/ln2. If either p_(i) or p_(i−1), indicates that there is noidentified pitch, D(p_(i−1), p_(i)) is set equal to a constant, e.g. 4.The value c[p_(i)] is the normalized autocorrelation at the shiftcorresponding to p_(i). If p_(i) indicates that there is no identifiedpitch, then we assign c[p_(i)]=0. The exact sequence of pitch periodcandidates minimizing this cost function can be computed by a dynamicprogramming procedure similar to that described in Section 3 on the notematching engine.

Block 55 seeks to detect regions 60 of the input waveform containinguseful sound such as humming or music (as opposed to silence or noise),based on the frame-by-frame pitch estimate 50 and the frame-basedauto-correlation 40. The manner in which this is preferably carried outis exemplified in FIG. 3. In FIG. 3, each position along the horizontalaxis represents a frame, with the “P” line 56 representing input pitchestimates 50 (FIG. 2) and the “E” line 57 representing the energy ofeach frame, as determined from the frame-based auto-correlation 40 (FIG.2). In this example (FIG. 3), the pitch estimates and energy estimateshave quantized values ranging from 1-9. The sound detection block 50first looks for regions that may have useful sound because a valid pitchestimate was computed in the block 45. This is shown in line “S1” 58 ofFIG. 3 where the symbol ‘H’ represents useful sound. Next, the sounddetection block 55 considers the average energy of the frames in eachregion. Where the average energy is below a specified threshold, theregion is considered to have no useful sound. This is shown in line “S2”59 of FIG. 3. In the illustrated example, the block 50 thus considersregion 60B of the input waveform as being silent. Conversely, regions60A and 60C are considered to contain useful sound. Regions containinguseful sound are sent to a breakpoint detection block 100.

The breakpoint detection block 100 (FIG. 2) also receives input from aparallel processing path comprising a high-pass filter 65, a framer 70and a spectral energy distribution indicator computation block 75. Thehigh-pass filter 65 filters the input waveform 20 in order to emphasizehigh frequency information that has been found to be useful in detectingthe breakpoints between notes. The framer 70 slices the filtered inputwaveform into frames, which are identical in scope and temporal positionto the frames generated by framer 30.

The spectral energy distribution (“SED”) indicator computation block 75computes a numerical measure or SED indicator 90, which indicates howthe sound energy is distributed in each frame. The SED indicatorpreferably assumes relatively high values if the sound energy isconcentrated near high frequencies and relatively low values if thesound energy is concentrated near other frequencies, as described ingreater detail below. For example, a 4 kHz frequency range may beconsidered with high frequencies deemed to those approaching 4 kHz andlow frequencies deemed to be those near zero kHz.

The breakpoint detection block 100 finds initial estimates for thelocations of note breakpoints (i.e., “candidate” breakpoints) 105 andcomputes a confidence measure 110 associated with each candidatebreakpoint 105. This confidence measure varies between 0 and 1, where avalue near 1 indicates that the breakpoint is very reliable.

The breakpoint detection block 100 operates on regions of the inputwaveform supplied by the sound detection block 55. In FIG. 3, forexample, these would be regions 60A and 60C. The detection block 100assigns breakpoints to the beginning and end frames of each region.Thus, the transitions between a frame with no pitch estimate and a framewith a valid pitch estimate is one method that may be used to identifybreakpoints. These breakpoints are given a confidence level of 1. Thisis exemplified in FIG. 3 by the “x” symbol in the “B” line 101.

Within each region, the block 100 detects candidate breakpoints based onminima present in the SED indicator 90. These are exemplified in FIG. 3by the “ˆ” symbol in the “B” line 101. The reason for this can beunderstood on an intuitive level by considering a melody waveform thatconsists of a sequence of notes, each of which is sung as “la.” Thevowel part “a” is typically longer in duration than the consonant partand is usually better defined spectrally. Therefore, it should providethe most reliable information for pitch extraction. Segmentation can beperformed if the “l” part of each “la” can be detected. Because “1” is asemivowel, it typically contains strong pitch periodicity. However,because of the constriction of the mouth during production, it containsless overall energy and less high frequency resonant structure.

This can also be seen through experimental observation. For example,FIG. 4A is a spectrogram of a normalized energy spectrum for a samplemelody hummed using “la” notes. (Note that FIG. 4A relates to a samplemelody that differs from that shown in FIG. 3.) More particularly, thenormalized energy spectrum is shown as a gray scale image whereinnormalized energy values approaching a maximum value are white andvalues near zero are black. The vertical axis of the spectrogramcorresponds to frequency and the horizontal axis corresponds to time.Thus, a vertical cross-section of the spectrogram essentiallycorresponds to one frame and represents the normalized energy spectrumof the frame as a function of frequency. The energy spectrum of a frameis defined as the squared magnitude of the Discrete Fourier Transform ofeach frame, and always assumes a positive value. The normalized energyspectrum of a frame is obtained by normalizing the energy spectrum ofthe frame by the total energy in the frame; i.e., the sum of the energyspectrum over all frequencies in the frame.

A strong banding structure (i.e., generally horizontal white lines)exists between frame nos. 50 and 350. The rest is basically noise. Thebands are harmonics (multiples) of the pitch frequency and move closerand farther apart as the pitch changes. The dominant band in each frameis not the pitch frequency, but some harmonic of it. Which harmonic isemphasized depends strongly upon the shape of the vocal tract and mouthat the time instant.

There are about ten notes in FIG. 4A with the breakpoints beingindicated by the vertical white lines 160 in the image. (Lines 160 arenot part of the spectrogram but are merely used to indicate the positionof the breakpoints in the image.) Breakpoints between notes can be seenwhere the dominant band shifts lower because constrictions in the vocaltract reduce the amount of high frequency energy uttered. This is shownmore clearly in FIGS. 4B and 4C. FIG. 4B shows the normalized energyspectrum plotted versus frequency for frame no. 150, which is a nearbreakpoint. FIG. 4C shows the same kind of plot for frame no. 170, whichis in the middle of a hummed note. A shift in the energy distribution tohigher frequencies is clearly evident. Note that these plots areessentially a cross-section through a vertical slice of the spectrogramillustrated in FIG. 4A.

The SED indicator 90 represents the shift in energy distribution. This anumerical measure which combines the spectral energies in each frame insuch a way that the value of that measure is large if the energydistribution is concentrated in certain frequency bands and small if theenergy distribution is concentrated in others. There are a variety ofways to compute the SED indicator.

In one implementation, the SED indicator 90 can be computed as the firstmoment of the energy spectrum in each frame divided by the zero^(th)moment. More particularly, let X(k) be the energy spectrum at frequencybin k; the corresponding spectral energy distribution measure is givenby $\frac{\sum\limits_{k}{{kX}(k)}}{\sum\limits_{k}{X(k)}}.$The summation is carried out over all frequency bins from 0 (DC) up tothe frequency bin corresponding to the Nyquist frequency. Frequency binspast the Nyquist frequency contain no additional information due toaliasing. This results in large values if X(k) is concentrated aroundlarge frequencies (large k). The graph of FIG. 5B plots the SEDindicator (when computed as just described) for the sample input melodyof FIG. 4A, i.e., for all frames. In FIG. 5B, vertical lines 162indicate the positions of breakpoints. FIG. 5A repeats FIG. 4A tofacilitate comparison. Note that the SED indicator drops to minimumvalues at or near the breakpoints.

Based on the SED indicator 90, the breakpoint detection block 100preferably derives a “minimum measure” at each frame, which is positiveif there is a local minimum in the SED indicator “near” thecorresponding frame, and zero otherwise. In this context the number of“near” frames is, for example, 15 frames before and 15 frames after thepresent frame. By considering or integrating such information over anumber frames the SED indicator can be smoothed. The amplitude of theminimum measure is larger the “deeper” the local minimum. A linearrelationship is preferably employed for determining the amplitude of theminimum measure but other types of relationships can be employed in thealternative such as power, exponential, and logarithmic relationships.FIG. 5C shows an example of the minimum measure for the SED indicatorshown in FIG. 5B. In FIG. 5C, vertical lines 164 indicate the positionsof breakpoints. It will be seen from FIG. 5C that the minimum measuretakes into consideration the relative depth of the local minima of theSED indicator in comparison to the surrounding plateau, and alsosmoothes the SED indicator to eliminate the many peaks and valleys afterframe no. 350. The breakpoint detection block 100 uses the minimummeasure to determine candidate breakpoints by finding the locations ofthe positive peaks therein.

If desired, an additional or alternative method for identifyingbreakpoints is by determining locations of rapid changes in the validpitch estimate across frames. The rate of change in pitch at a givenframe can be determined from examination of the pitch changes insurrounding frames. For example, if p_(i) is the pitch estimate at framei, the rate of change is can be estimated as being proportional to$\sum\limits_{k = {- r}}^{r}{kp}_{i + k}$where the parameter r determines the size of the neighborhood in thepast and future frames used to determine the pitch change. An examplechoice might be r=3. Larger values are less influenced by noise or pitchmis-estimations, but on the other hand will have less temporalresolution.

The confidence measure 110 for each candidate breakpoint is preferably aweighted sum of four numbers. The first number is large if the absolutevalue of the SED indicator is “small” in the neighborhood of thebreakpoint, e.g., less than about 75% of the average value over theinput waveform. The second number is large if the minimum measure in thevicinity of the breakpoint is “large”, e.g., larger than about 80% ofthe maximum value over the input waveform. The third number is large ifthe rate of change of pitch at the breakpoint is “large”, e.g., morethan about 10 semitones per second. The fourth number is large if theaverage energy in frames around the breakpoint is “small”, e.g. lessthan 50% of the maximum value in some neighborhood around the candidatebreakpoint. Preferably, each of these numbers is weighted equally,although a variety of weightings may be used in the alternative.

At block 115, only those breakpoint candidates 105 with confidencemeasures 110 exceeding a certain “threshold” are retained, e.g., 0.45.This yields the final note breakpoints 125, which delineate the notesand their beat durations, and final confidence measures 120.

At block 115, a confidence measure 122 is also associated with each noteidentified between two breakpoints 125. This confidence measure isdesigned to indicate the possibility that the identified note does notcontain more than one note from the input melody, due to a missedbreakpoint in the breakpoint detection block 100. The note confidencemeasure 122 is a weighted sum of four numbers. The first number is smallif the variation of the SED indicator for frames within the note is“large,” e.g., the difference between the maximum and minimum value isgreater than some percentage (e.g. 20%) of the average value. The secondnumber is small if the maximum “minimum measure” taken over all framesin the note is large, e.g. greater than 20% of the maximum value overthe input waveform. The third is small if the variation of theidentified pitch periods for frames inside the note is large; e.g. themaximum and minimum values vary by more than one semitone. The fourth issmall if the variation in the energy level for frames inside the note is“large”; e.g. the difference between the maximum and minimum value islarger than some percentage (e.g. 20%) of the average value. Note thatthe dependence of the note confidence measure on the SED indicator,minimum measure and identified pitch is opposite that for the breakpointconfidence measure. This is because the breakpoint confidence measureindicates the confidence that a breakpoint was not mistakenly added. Onthe other hand, the note confidence measure indicates the confidencethat a breakpoint was not mistakenly deleted.

At block 130, the pitch for each note is determined by merging the pitchperiods across the frames falling between two breakpoints delineatingthat note. It is preferred to merge the pitch by finding the median ofthe pitch estimates between the two breakpoints. The median computationis less sensitive than the average to occasional large errors in thepitch estimates at individual frames. This yields the note pitch 135.

The differential note and timing file 150 is generated by block 140. Thepitch ratio and the beat duration ratio are expressed as the log of theratio between two consecutive notes which are given as follows:

-   -   ${RF}_{i} = {\log_{2}\left( \frac{F_{i + 1}}{F_{i}} \right)}$        where F_(i) and F_(i+1) are the pitch frequencies of notes i and        i+1, respectively,    -   ${RT}_{i} = {\log_{2}\left( \frac{T_{i + 1}}{T_{i}} \right)}$        where T_(i) and T_(i+1) are the beat durations of notes i and        i+1, respectively.

From the foregoing segmentation and pitch estimation, useful informationhas been extracted about the input melody. However, this information maycontain errors as the user may vocalize or play some notes in anincorrect pitch or with incorrect beat duration. The note-matchingengine has the capability and flexibility to tolerate errors in bothnotes and beats, as discussed next.

3. Note-Matching Engine

The note-matching engine 16 (FIG. 1) is a score-based engine. Itgenerates a score for each song in the reference database 14 based onthe similarity of the input melody input to the songs in the database,taking into account the confidence levels of each identified breakpointand each extracted note. By using dynamic programming the engine 16attempts to compensate for errors generated either by the user, who mayhave vocalized or played the melody with wrong notes or wrong beats, orby the melody-note conversion subsystem 12, which may miss some notes,over-count notes or measure the note duration incorrectly.

Instead of using absolute beat and note information, the preferredembodiment uses relative beat and note information for the matchingprocess because the user may vocalize or play the melody in any scale,and not necessarily the 12-tone octave scale. Similarly, the user mayvocalize or play the melody in any tempo. Therefore, relative pitch andbeat data is preferred.

The inputs to the matching engine 16 are the differential note andtiming file 150 and candidate differential note and timing files fromthe music database 14. To compensate for the insertion and deletionerrors caused by the user or the conversion subsystem 12, it isdesirable to find the likelihood of matching instead of an exact matchbetween two files. This problem is similar to the classical longestcommon subsequence problem in which two strings are given and a maximumlength common subsequence of these two strings is found. Thenote-matching engine 16 employs a dynamic programming approach describedbelow to solve this problem in an optimal manner.

The engine 16 sets up a 2-dimensional matrix 180 for each song matching,as exemplified in FIG. 10. The Y-axis of the matrix 180 represents astring Y=(Y₁,Y₂, . . . ,Y_(m)) from the differential note and timingfile of the candidate song where each entry Y_(i) is a tuple or vector(YRF_(i), YRT_(i)). YRF represents the pitch ratio and YRT representsthe beat duration ratio of the corresponding entry. The X-axis of thematrix 180 represents a string X=(X₁,X₂, . . . ,X_(n)) from thedifferential note and timing file 150 generated by the note conversionsubsystem 12. Each entry X_(i) is a 4-tuple or vector (XRF_(i), XRT_(i),XICON_(I), XDCON_(i)) where XRFi, XRT_(i), XICON_(i) and XDCON_(i)represent the pitch ratio, the beat duration ratio, the confidence levelof the note breakpoint, and the confidence level of the note precedingthe breakpoint, respectively.

The cost of a matching between an entry in X and an entry in Y isdefined as the weighted sum of the absolute difference between thecorresponding RF and RT. For example, the cost of matching Y_(j) andX_(i) is equal tomatch_cost(X _(i) ,Y _(j))=α|YRF _(i) −XRF _(i) |+β|YRT _(i) −XRT _(i)|

where α and β are the relative weights of pitch and beat durationratios, respectively. The cost reflects the error of matching X_(i) withY_(j). If an entry Xi in X is perfectly matched with an entry Y_(j) inY, the cost of match is equal to zero. The objective of thesong-matching algorithm is to find the subsequence of Y with the minimummatching cost with X. The score of matching is thus the cost ofmatching. The lower the score, the better the match. If there is noinsertion or deletion error in the input differential note and timingfile 150, then the cost of matching the string (X₁,X₂, . . . ,X_(n))with a sub-string (Y_(j), . . . ,Y_(j+n−1)) in Y is given by thefollowing recursive formula:min_match_cost((X ₁ ,X ₂ , . . . ,X _(n)), (Y _(j) , . . . ,Y_(j+n−1)))=match_cost(X_(n) ,Y _(j+n−1))+min_match_cost((X₁ ,X ₂ , . . .,X _(n−1)), (Y _(j) , . . . ,Y _(j+n−2))).

In practice, the j index may range from 1 to m−n+1 (where m is the totalnumber of notes in Y). The lowest value of min_match_cost( ) (as jranges from m−n+1) is selected as the score for the candidate song.

However insertion or deletion errors may happen. To compensate for thisthe engine 16 allows for matching with note insertions or notedeletions. If there is an insertion before note X_(n), the cost ofmatching is given by:min_match_cost((X ₁ ,X ₂ , . . . ,X _(n)), (Y_(j) , . . . ,Y_(j+n−1)))=match_cost(X_(n) ,Y _(j+n−1))+min _match_cost((X ₁ ,X ₂ , . .. ,X _(n−2)), (Y _(j) , . . . ,Y _(j+n−2)))

For k number of insertions, the cost of matching is given by:min_match_cost((X ₁ ,X ₂ , . . . ,X _(n)), (Y_(j) , . . . ,Y_(j+n−1)))=match_cost(X _(n) ,Y _(j+n−1))+min_match_cost((X ₁ ,X ₂, . .. ,X_(n−k−1)), (Y_(j) , . . . ,Y _(j+n−2)))

If there is a deletion before the note X_(n), the cost of matching isgiven by:min_match_cost((X ₁ ,X ₂, . . . ,X_(n)), (Y_(j) , . . . ,Y_(j+n−1)))=match_cost(X _(n) ,Y _(j+n−1))+min_match_cost((X ₁ ,X ₂ , . .. ,X _(n−1)), (Y_(j) , . . . ,Y _(j+n−3)))

For k number of deletions, the cost of matching is given by:min_match_cost((X ₁ ,X ₂, . . . ,X_(n)), (Y_(j) , . . . ,Y_(j+n−1)))=match_cost(X _(n) ,Y _(j+n−1))+min_match_cost((X ₁ ,X ₂, . .. ,X_(n−1)), (Y_(j) , . . . ,Y _(j+n−k−2))).

Insertion and deletion are not the norm. So, the matching process,although it allows for insertion and deletion, also adds a penalty termwhen the engine 16 tries to match notes assuming there are k insertionsor deletions. However, the conversion subsystem 12 provides a confidencelevel for every breakpoint and every note that indicates how likely thebreakpoint is a “correct” breakpoint and how likely the note is a“correct” note. A low breakpoint confidence level means that thetransition is likely to be a wrong transition and hence may result in aninsertion error. So a low breakpoint confidence level also implies thenote is likely to be an insertion. A low note confidence level meansthat the note is likely to be composed from several notes andbreakpoints are mistakenly deleted. Therefore, when a low noteconfidence level is encountered, there is a higher chance that adeletion error occurred. (In other words, the breakpoint confidencelevel reflects note insertion error and the note confidence levelreflects note deletion error.) Hence, if the note is matched assumingthere is an insertion or deletion error, the penalty should be lowered.For this reason the engine 16 adjusts the penalty by weighting it withthe breakpoint and note confidence levels. A breakpoint that isassociated with a low confidence level is more likely to be an insertionand hence incurs a lower penalty during matching for note insertion. Anote that is associated with a low confidence level is more likely to bea deletion and hence incurs a lower penalty during matching for notedeletion. The above min_match_cost calculations are updated as follows:For k insertions:$\left. {{{min\_ match}{\_ cost}\left( {\left( {X_{1},X_{2},\ldots\quad,X_{n}} \right),\left( {Y_{j},\ldots\quad,Y_{j + n - 1}} \right)} \right)} = {{{match\_ cost}\left( {X_{n},Y_{j + n - 1}} \right)} + {{min\_ match}{\_ cost}\left( {\left( {X_{1},X_{2},\ldots\quad,X_{n - k - 1}} \right),\left( {Y_{j},\ldots\quad,Y_{j + n - 2}} \right)} \right)} + {\sum\limits_{i = 1}^{k}{\left( {{penalty}\quad{of}\quad{the}\quad i^{th}\quad{insertion}} \right)*\left( {{the}\quad{corresponding}\quad i^{th}\quad{breakpoint}}’ \right.s\quad{confidence}\quad{level}}}}} \right)$For k deletions:min_match_cost((X ₁ ,X ₂ , . . . ,X _(n)), (Y_(j), . . .,Y_(j+n−1)))=match_cost(X _(n) ,Y _(j+n−1))+min_match_cost((X ₁ ,X ₂ , .. . ,X _(n−1)), (Y _(j) , . . . ,Y _(j+n−k−2))).+(penalty of kdeletions) *(the corresponding note's confidence level)

Based on the recursive structure of the minimum matching costcalculation, a dynamic programming approach is used to implement thenote-matching algorithm. FIG. 10 illustrates the above cost calculation.This figure shows the first 4×4 matrix for a song in a database beingcompared against a four-note hummed melody. The note matching engine 16operates in a reverse direction, i.e., the last note of the hummedmelody is considered first against the latest notes of the song. Foreach matrix point, the engine 16 seeks a preceding note having thelowest cost, which translates into highest similarity in relative pitch,relative beat duration and confidence level. At matrix point (4,4) theengine 16 considers the following possibilities: Matrix Cost DirectionPoint Meaning Calculation (−1, −1) (3, 3) Notes and beats are normalα|64.0-64.3| + β|0.75-0.80| sequence, i.e., no insertions or deletions(−1, −2) (3, 2) Note is missing in hummed α|62.0-64.3| + β|0.5-0.80| +melody (cost of 1 deletion) * 0.7 (−1, −3) (3, 12) Two notes are missingin α|60-64.3| + β|0.5-0.8| + (cost hummed melody of 2 deletions) * 0.7(−2, −1) (2, 3) Extra note is added/inserted in α|64-62| + β|0.75-0.5| +(cost hummed melody of 1^(st) insertion) * 0.8 (−3, −1) (1, 3) Two extranotes α|64-60.2| + β|0.75-0.52| + (cost added/inserted in hummed of 1stinsertion) * 0.8 + (cost of melody 2^(nd) insertion) * 0.6

It will thus be seen from the foregoing that at a matching point (X_(i),Y_(j)) in the matrix formed by X and Y, the engine 16 searches for apreceding set of notes {(X_(i−1−k),Y_(j−1)), (X_(i−1),Y_(j−1−k))},0≦k≦max_(k), which minimize a match cost defined as follows:${{{{If}\quad k} = 0},{{\alpha{{{YRF}_{j} - {XRF}_{u - 1}}}} + {\beta{{{YRT}_{j - 1} - {XRT}_{i - 1}}}}},{{{else}\quad{if}\quad k} > 0},{{\alpha{{{YRF}_{j - 1} - {XRF}_{i - 1 - k}}}} + {\beta{{{YRT}_{j - 1} - {XRT}_{i - 1 - k}}}} + {\sum\limits_{m = 0}^{k - 1}{\left( {{penalty}\quad{for}\quad{the}\quad\left( {m + 1} \right)^{th}\quad{insertion}} \right)*{XICON}_{i - 1 - m}\quad{or}}}}}\quad$αYRF_(j − 1 − k) − XRF_(i − 1) + βYRT_(j − 1 − k) − XRT_(i − 1) + (penalty  for  k  deletions) * XDCON_(i − 1)where  α  and  β  are  weights.4. Utility

The melody retrieval system 10 can be used in, but is not limited to,the following applications:

-   -   Intelligent user interface of a music jukebox. Thousands of        songs are typically stored inside a typical jukebox. To select a        song from this large database is sometimes not an easy task        using the traditional input method. Using the melody retrieval        system 10, the user can hum a few notes and the system will        search through all the songs stored in the jukebox and then        output the songs that most closely match the humming melody. The        user can then pick the song he or she wants. This system can be        extended beyond the jukebox application to many consumer audio        entertainment products that store songs, such as portable music        players like MD™ player, Walkman™, Discman™, MP3™ portable        player, and others.    -   A tool to search for a song or music piece on the Internet. The        music retrieval system 10 can be used as an Internet        music-searching engine similar to a conventional text-based web        page searching engine. The user hums the melody and the tool can        initiate a search in an online music database. Such a tool can        also preferably spawn multiple searches in multiple databases.        The results of the parallel search can be consolidated, sorted        and output according to the matching scores. The output may also        be a hypertext link such that the user can directly select the        song he or she wants and connect to the web site that stores the        song for purchasing or downloading.    -   A tool to help cellular phone users to download songs from        cellular phone or wireless content providers. The next        generation mobile phones (e.g. 3G cellular phone) not only        support high bit rate transmission but also have the local        digital signal processing power to decode digitally-compressed        audio format such as MP3. However it may be difficult for users        to select songs from the mobile phone due to the small numerical        keypad interface and small LCD screen. The music retrieval        system 10 can be employed to enable the user to hum a melody        into the mobile phone which can then transmit the input melody        back to the base station where the system 10 is preferably        located. Once the note-matching engine 16 finishes the matching        process the output subsystem 18 can transmit a list of the        top-ranked songs back to the user. The list can be displayed on        the screen of the mobile phone for the user to select the song        to download.

Password protection. Rather than having a text-based password protectionmechanism to access user accounts and the like, a query based on hummingcan be employed in the alternative.

5. Variants

One aspect of the invention is concerned with estimating or determiningbreakpoints based on changes in the spectral energy distribution of theinput melody. One implementation of the SED indicator has beendescribed. There are alternative ways of computing the SED indicatorwhich nevertheless yield similar properties to the above-describedimplementation. One broad class of SED indicators is defined by$\frac{\sum\limits_{k}{{f(k)}{g\left( {X(k)} \right)}}}{\sum\limits_{k}{g\left( {X(k)} \right)}}$where f(k) and g(X(k)) are non-negative and non-decreasing functions ofk and X(k), respectively. The previously described implementation of theSED indicator used f(k)=k and g(X(k))=X(k). However, other choices canproduce similar results. For example:

-   -   FIGS. 6A-6C show plots similar to FIGS. 5A-5C where the SED        indicator is defined according to        $\frac{\sum\limits_{k}{\sqrt{k}{X(k)}}}{\sum\limits_{k}{X(k)}},$        i.e., f(k)=√{square root over (k)} and g(X(k))=X(k).    -   FIGS. 7A-7C show plots similar to FIGS. 5A-5C where the SED        indicator is defined according to        $\frac{\sum\limits_{k}{k^{2}{X(k)}}}{\sum\limits_{k}{X(k)}},$        i.e., f(k)=k² and g(X(k))=X(k).    -   FIGS. 8A to 8C show plots similar to FIGS. 5A-5C where the SED        indicator is defined according to        $\frac{\sum\limits_{k}{{\sin\left( \frac{\pi\quad k}{2K} \right)}{X(k)}}}{\sum\limits_{k}{X(k)}},$        where K is the frequency bin corresponding to the Nyquist        frequency, i.e.,        ${f(k)} = {\sin\left( \frac{\pi\quad k}{2K} \right)}$        and g(X(k))=X(k).    -   FIGS. 9A-9C show plots similar to FIGS. 5A-5C where the SED        indicator is defined according to        $\frac{\sum\limits_{k}{{kX}(k)}^{2}}{\sum\limits_{k}{X(k)}^{2}},$        i.e., f(k)=k and g(X(k))=X(k)².

In the preferred embodiment, the SED indicator is defined so that itachieves large values if the energy spectrum is concentrated in highfrequencies and small values if the energy spectrum is concentrated atlow frequencies. An inverse relationship may be employed. Also,alternative embodiments may choose different frequency ranges, such asachieving large values within a band or bands of frequencies and lowvalues outside that band or bands. This might be done to differentiateother types of breakpoints.

It should also be appreciated that the SED indicator need not becomputed from the energy spectrum. For example, the SED indicatorillustrated in FIG. 5A could be computed by estimating the slope at theorigin of the auto-correlation of each frame and normalizing that slopeby the value at the origin. This is due to the fact that theauto-correlation and the energy spectrum are Fourier Transform pairs,and thus contain the same information.

Many examples have been given for the various parameters used in thespectral analysis techniques discussed herein. This is done for thepurpose of illustration only and not intended to be limiting. Withoutlimiting the generality of the foregoing, these examples includes: the“largest” auto-correlation peaks; the “closeness” of pitch candidates toother pitch estimates; the “nearness” of frames to a local minimum inthe SED indicator; the “depth” of the local minimum; the “smallness” ofthe absolute values of the SED indicators; the “largeness” of theminimum measure in the vicinity of a breakpoint; the “largeness” of therate of change of pitch; and the “threshold” for the confidence measure.As will be appreciated by those skilled in this art, a wide range ofcrisp values can used to implement what are essentially fuzzy logicconcepts.

The preferred embodiment has been presented in a system block diagramformat, but in practice the invention may be implemented in software orhardware, or a combination of both. Similarly, those skilled in the artwill understand that numerous other modifications and variations may bemade to the embodiments disclosed herein without departing from thespirit or scope of the invention.

1. A method for converting a digitized melody into a sequence of notes,comprising: segmenting said melody into a series of frames; computing aspectral energy distribution (SED) indicator for each frame; andestimating initial breakpoints in said melody based on said SEDindicator, said notes being defined between adjacent initialbreakpoints.
 2. A method according to claim 1, wherein the value of saidSED indicator for a given frame is relatively large if an energydistribution associated with said frame is concentrated in one or morespecified frequency bands.
 3. A method according to claim 2, includingfiltering said melody with a high pass filter prior to segmenting saidmelody into said frames.
 4. A method according to claim 3, wherein saidenergy distribution is determined from a normalized energy spectrum ofsaid frame.
 5. A method according to claim 3, wherein said specifiedfrequency band is the upper portion of a 0 to 4 kHz range.
 6. A methodaccording to claim 3, wherein the SED indicator is defined as$\frac{\sum\limits_{k}{{f(k)}{g\left( {X(k)} \right)}}}{\sum\limits_{k}{g\left( {X(k)} \right)}},$where X(k) is the energy spectrum of a frame at frequency bin k and f(k)and g(X(k)) are non-negative and non-decreasing functions of k and X(k),respectively.
 7. A method according to claim 6, wherein the SEDindicator is defined as$\frac{\sum\limits_{k}{{kX}(k)}}{\sum\limits_{k}{X(k)}}.$
 8. A methodaccording to claim 6, wherein the SED indicator is defined as$\frac{\sum\limits_{k}{\sqrt{k}{X(k)}}}{\sum\limits_{k}{X(k)}}.$
 9. Amethod according to claim 6, wherein the SED indicator is defined as$\frac{\sum\limits_{k}{k^{2}{X(k)}}}{\sum\limits_{k}{X(k)}}.$
 10. Amethod according to claim 6, wherein the SED indicator is defined as$\frac{\sum\limits_{k}{{\sin\left( \frac{\pi\quad k}{2K} \right)}{X(k)}}}{\sum\limits_{k}{X(k)}},$where K is the frequency bin corresponding to the Nyquist frequency. 11.A method according to claim 6, wherein the SED indicator is defined as$\frac{\sum\limits_{k}{{kX}(k)}^{2}}{\sum\limits_{k}{X(k)}^{2}}.$
 12. Amethod according to claim 3, wherein the auto-correlation of each saidframe is computed and said SED indicator is computed by estimating theslope at the origin of the frame's auto-correlation and normalizing thatslope by the value at the origin.
 13. A method according to claim 1,including estimating the pitch of each said frame.
 14. A methodaccording to claim 13, wherein estimating the pitch of each framecomprises: computing the auto-correlation of each said frame; andestimating the pitch of each said frame by selecting a pitch periodcorresponding to a shift where the auto-correlation coefficientassociated with the frame is relatively large.
 15. A method according toclaim 1, including estimating the pitch of each said note betweenadjacent initial breakpoints.
 16. A method according to claim 15,wherein estimating the pitch of each note between initial breakpointscomprises: computing the auto-correlation of each said frame; estimatingthe pitch of each said frame by selecting a pitch period correspondingto a shift where the auto-correlation coefficient associated with theframe is relatively large; and averaging or taking the median of thepitch estimates of frames between adjacent breakpoints.
 17. A methodaccording to claim 15, including associating each said initialbreakpoint with a confidence level, which is influenced by at least oneof (a) the degree in the change or rate of change of pitch in the framesaround the initial breakpoints, and (b) the value of said SED indicatorin the vicinity of the initial breakpoint.
 18. A method according toclaim 17, wherein the confidence level is further influenced by theenergy level of said melody in the vicinity of the initial breakpoint.19. A method according to claim 17, including eliminating fromconsideration initial breakpoints associated with confidence levelsbelow a specified threshold, thereby identifying breakpoints in saidmelody.
 20. A method according to claim 19, including estimating thepitch and beat duration of each said note between said breakpoints. 21.A method according to claim 1, wherein the melody is a voice-hummedmelody composed of a series of uttered semi-vowels.
 22. Apparatus forconverting a digitized melody into a sequence of notes, comprising:means for segmenting said melody into a series of frames; means forcomputing a spectral energy distribution (SED) indicator for each frame;and means for estimating initial breakpoints in said melody based onsaid SED, said notes being defined between adjacent initial breakpoints.23. Apparatus according to claim 22, wherein the value of said SEDindicator for a given frame is relatively large if an energydistribution associated with said frame is concentrated in one or morespecified frequency bands.
 24. Apparatus according to claim 23,including filtering said melody with a high pass filter prior tosegmenting said melody into said frames.
 25. Apparatus according toclaim 24, wherein said energy distribution is determined from anormalized energy spectrum of said frame.
 26. Apparatus according toclaim 24, wherein said specified frequency band is the upper portion ofa 0 to 4 kHz range.
 27. A method for converting a digitized melody intoa sequence of notes, comprising: segmenting said melody into a series offrames; computing the auto-correlation of each said frame; estimatingthe pitch of each said frame based on (i) a pitch period correspondingto a shift where the auto-correlation coefficient associated with theframe is relatively large and (ii) the closeness of the pitch estimateto estimates in one or more adjacent frames; and estimating breakpointsin said melody based on changes in said pitch estimates, said notesbeing defined between adjacent breakpoints.
 28. A method according toclaim 27, wherein said breakpoints are estimated based on a rate ofchange of said pitch estimates.
 29. A method according to claim 27,including filtering said melody with a band pass filter prior tosegmenting the melody into frames.
 30. A method according to claim 27,including estimating the pitch of each note by selecting the average ormedian pitch of the frames falling within a pair of breakpoints.
 31. Amethod according to claim 27, wherein the melody is a voice-hummedmelody.
 32. Another aspect of the invention provides a method foridentifying breakpoints in a digitized melody, the method comprising:segmenting the melody into a series of frames; computing theauto-correlation of each frame; estimating the pitch of each frame basedon (i) a pitch period corresponding to a shift where theauto-correlation coefficient associated with the frame is relativelylarge and (ii) the closeness of the pitch estimate to estimates in oneor more adjacent frames; determining regions of said melody where pitchestimates are likely to be invalid; and identifying said breakpoints inthe melody based on transitions between frames having valid pitchestimates and transitions having invalid pitch estimates.
 33. A methodaccording to claim 32, wherein said breakpoints are estimated based on arate of change of said pitch estimates.
 34. A method according to claim32, including filtering said melody with a band pass filter prior tosegmenting the melody into frames.
 35. A method according to claim 32,including estimating the pitch of each note by selecting the average ormedian pitch of the frames falling within a pair of breakpoints.
 36. Amethod according to claim 32, wherein the melody is a voice-hummedmelody.
 37. Apparatus for converting a digitized melody into a sequenceof notes, comprising: means for segmenting said melody into a series offrames; means for computing the auto-correlation of each said frame;means for estimating the pitch of each said frame based on (i) a pitchperiod corresponding to a shift where the auto-correlation coefficientassociated with the frame is relatively large and (ii) the closeness ofthe pitch estimate to estimates in one or more adjacent frames; meansfor determining regions of said melody where pitch estimates are likelyto be invalid; and means for estimating breakpoints in said melody basedon changes in said pitch estimates or transitions between frames havingvalid pitch estimates and frames having no pitch estimates, said notesbeing defined between adjacent breakpoints.
 38. A method of retrievingat least one entry from a music database, wherein each said entry isassociated with a sequence of pitches and beat durations, said methodcomprising: receiving a digitized representation of an input melody;identifying breakpoints in said melody in order to define notes therein,each said notes being delineated by adjacent breakpoints; assigning aconfidence level to each note or each breakpoint; determining a pitchand beat duration for each note of said melody; determining a score foreach said entry based on a search which minimizes the cost of matchingthe pitches and beat durations of said melody and said entry, whereinsaid search considers at least one deletion or insertion error in aselected note of said melody and, in this event, penalizes the cost ofmatching based on the confidence level of the selected note or abreakpoint associated therewith; and presenting said at least one entryto a user based on its score.
 39. A method according to claim 38,wherein said pitches and beat durations are relative pitches andrelative beat durations.
 40. A method according to claim 38, wherein thecost of matching a given note X_(i) of said melody with a given noteY_(j) associated with said entry is: match_cost(X_(i),Y_(j))=α|YRF_(j)−XRF_(j)|+β|YRT_(j)−XRT_(i)|, where YRF_(j) andYRT_(j) respectively represent the relative pitch and relative beatduration of the note associated with said entry; XRF_(i) and XRT₁respectively represent the relative pitch and relative beat duration ofthe note associated with said melody; and α and β are weights.
 41. Amethod according to claim 38, wherein: a confidence level is assigned toeach note and each breakpoint; and said search considers deletion andinsertion errors for any given note of said melody and, in this event,penalizes the cost of matching based on the confidence level of thegiven note and the confidence level of a breakpoint associated with thegiven note.
 42. A method according to claim 41, wherein: X is a sequenceof notes, X_(i), of said melody, each X_(i) having components XRF_(i),XRT_(i) XICON_(i), and XDCON_(i) which respectively represent therelative pitch, relative beat duration, confidence level of thebreakpoint and confidence level of the note associated with said melody;Y is a sequence of notes, Y_(j), of said entry, each Y_(j) havingcomponents YRF_(j) and YRT_(j) which respectively represent the relativepitch and relative beat duration of the note associated with said entry;X and Y form a matrix, and at a matching point (X_(i), Y_(j)) saidsearch seeks to identify a preceding set of notes {(X_(i−1−k), Y_(j−1)),(X_(i−1), Y_(j−1−k))}, 0≦k≦max_(k), which minimize a match cost definedas follows:${{{if}\quad k} = 0},{{\alpha{{{YRF}_{j} - {XRF}_{u - 1}}}} + {\beta{{{YRT}_{j - 1} - {XRT}_{i - 1}}}}},{{{else}\quad{if}\quad k} > 0},{{\alpha{{{YRF}_{j - 1} - {XRF}_{i - 1 - k}}}} + {\beta{{{YRT}_{j - 1} - {XRT}_{i - 1 - k}}}} + {\sum\limits_{m = 0}^{k - 1}{\left( {{penalty}\quad{for}\quad{{the}\left( {m + 1} \right)}^{th}\quad{insertion}} \right)*{XICON}_{i - 1 - m}{\quad\quad}{or}}}}$αYRF_(j − 1 − k) − XRF_(i − 1) + βYRT_(j − 1 − k) − XRT_(i − 1) + (penalty  for  k  deletions) * XDCON_(i − 1)where  α  and  β  are  weights.
 43. Apparatus for retrieving at leastone entry from a music database, wherein each said entry is associatedwith a sequence of pitches and beat durations, said apparatuscomprising: means for receiving a digitized representation of an inputmelody; a melody-to-note conversion subsystem for identifyingbreakpoints in said melody in order to define notes therein, saidsubsystem determining a pitch and beat duration for each note of saidmelody and associating each note or each breakpoint with a confidencelevel; a note-matching engine for determining a score for each saidentry based on a search which minimizes the cost of matching the pitchesand beat durations of said melody and said entry, wherein said searchconsiders at least one deletion or insertion error in a selected note ofsaid melody and, in this event, penalizes the cost of matching based onthe confidence level of the selected note or a breakpoint associatedtherewith; and an output subsystem for presenting said at least oneentry to a user based on its score.
 44. A method of retrieving at leastone entry from a music database, wherein each said entry is associatedwith a sequence of pitches and beat durations, said method comprising:receiving a digitized representation of an input melody; identifyingbreakpoints in said melody in order to define notes therein, each saidnotes being delineated by adjacent breakpoints; associating a confidencelevel with each note pertaining to likelihood that said note contains anote insertion error; determining a pitch and beat duration for eachnote of said melody; determining a score for each said entry based on asearch which minimizes the cost of matching the pitches and beatdurations of said melody and said entry, wherein said search considersat least one insertion error in a selected note of said melody and, inthis event, penalizes the cost of matching based on the confidence levelassociated with the selected note; and presenting said at least oneentry to a user based on its score.
 45. A method of retrieving at leastone entry from a music database, wherein each said entry is associatedwith a sequence of pitches and beat durations, said method comprising:receiving a digitized representation of an input melody; identifyingbreakpoints in said melody in order to define notes therein, each saidnotes being delineated by adjacent breakpoints; associating a confidencelevel with each note pertaining to likelihood that said note contains anote deletion error; determining a pitch and beat duration for each noteof said melody; determining a score for each said entry based on asearch which minimizes the cost of matching the pitches and beatdurations of said melody and said entry, wherein said search considersat least one deletion error in a selected note of said melody and, inthis event, penalizes the cost of matching based on the confidence levelassociated with the selected note; and presenting said at least oneentry to a user based on its score.
 46. A method for determiningconfidence levels for breakpoints or notes in a waveform representing amelody, the method comprising: segmenting the waveform into a series offrames, wherein adjacent breakpoints encompass one or more sequentialframes; executing at least two of the following three steps, (a)computing a spectral energy distribution (SED) indicator for each frame,(b) estimating the pitch of each frame, and (c) determining the energylevel of each frame, deriving the confidence levels based on at leasttwo of the following three characteristics, (i) the SED indicator, (ii)changes in pitch, and (iii) the energy level.
 47. A method according toclaim 46, wherein the confidence level for a given breakpoint iscomputed as a weighted combination of at least two of three numbers, thefirst number based on the value of the SED indicator in the vicinity ofthe given breakpoint, the second number being based on a change in pitchin the frames before and after the given breakpoint, and the thirdnumber being based on the energy level of the frames in the immediatevicinity of the breakpoint.
 48. A method according to claim 46, whereinthe confidence level for a given note is computed as a weightedcombination of at least two of three numbers, the first number based onthe value of the SED indicator in the given note, the second numberbeing based on the variation in pitch in the given note, and the thirdnumber being based on the energy level of the frames in the given note.49. A method for determining confidence levels for breakpoints or notesin a waveform representing a melody, the method comprising: segmentingthe waveform into a series of frames, wherein adjacent breakpointsencompass one or more sequential frames; computing a spectral energydistribution (SED) indicator for each frame; estimating the pitch ofeach frame; and deriving the confidence levels based on the SEDindicator and changes in pitch.
 50. A method according to claim 49,wherein the confidence level for a given breakpoint is computed as aweighted combination of a first number based on the value of the SEDindicator in the vicinity of the given breakpoint and a second numberbased on a change in pitch in the frames before and after the givenbreakpoint.
 51. A method according to claim 49, wherein the confidencelevel for a given note is computed as a weighted combination of a firstnumber based on the value of the SED indicator within the given note anda second number based on the variation in pitch within the given note.52. A method according to claim 49, wherein the value of the SEDindicator for a given frame is relatively large if an energydistribution associated with the frame is concentrated in one or morespecified frequency bands.
 53. A method according to claim 52, includingfiltering the melody with a high pass filter prior to segmenting themelody into frames.
 54. A method according to claim 53, wherein theenergy distribution is determined from a normalized energy spectrum ofthe frame.
 55. A method according to claim 54, wherein the specifiedfrequency band is in the upper portion of a 0-4 kHz frequency range. 56.A method for determining confidence levels for breakpoints or notes in awaveform representing a melody, the method comprising: segmenting thewaveform into a series of frames, wherein adjacent breakpoints encompassone or more sequential frames; computing a spectral energy distribution(SED) indicator for each frame; determining the energy level of eachframe; and deriving the confidence levels based on the SED indicator andthe energy level.
 57. A method according to claim 56, wherein theconfidence level for a given break point is computed as a weightedcombination of a first number based on the value of the SED indicator inthe vicinity of the given breakpoint and a second number based on theenergy level of the frame in the immediate vicinity of the breakpoint.58. A method according to claim 56, wherein the confidence level for agiven note is computed as a weighted combination of a first number basedon the value of the SED indicator in given note and a second numberbased on the energy level of the frames in the given note.
 59. A methodaccording to claim 56, wherein the value of the SED indicator for agiven frame is relatively large if an energy distribution associatedwith the frame is concentrated in one or more specified frequency bands.60. A method according to claim 59, including filtering the melody witha high pass filter prior to segmenting the melody into frames.
 61. Amethod according to claim 60, wherein the energy distribution isdetermined from a normalized energy spectrum of the frame.
 62. A methodaccording to claim 61, wherein the specified frequency band is the upperportion of a 0-4 kHz frequency range.